Supervised Learning Classification Project: AllLife Bank Personal Loan Campaign¶

Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [ ]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Import the metrics class from sklearn
from sklearn import metrics
# Import the variance_inflation_factor class from statsmodels.stats.outliers_influence
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Import the train_test_split class from sklearn.model_selection
from sklearn.model_selection import train_test_split
# Import the DecisionTreeClassifier class from sklearn.tree
from sklearn.tree import DecisionTreeClassifier
# Import the tree module from sklearn
from sklearn import tree
# Import the GridSearchCV class from sklearn.model_selection
from sklearn.model_selection import GridSearchCV
# Import the f1_score, accuracy_score, recall_score, precision_score, confusion_matrix, ConfusionMatrixDisplay, and make_scorer functions from sklearn.metrics
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    make_scorer,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
)

Loading the dataset¶

In [ ]:
# Load the dataset
df = pd.read_csv('Loan_Modelling.csv')

Data Overview¶

  • Observations
  • Sanity checks
In [ ]:
df.head()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [ ]:
df.tail()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1
In [ ]:
print(f"Number of rows: {df.shape[0] }, Number of columns {df.shape[1]}")
Number of rows: 5000, Number of columns 14
In [ ]:
# Data frame information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Observations¶

  • There are 5000 rows and 14 columns in the DataFrame.
  • The data types of the columns are:
    • ID: int64
    • Age: int64
    • Experience: int64
    • Income: int64
    • ZIPCode: int64
    • Family: int64
    • CCAvg: float64
    • Education: int64
    • Mortgage: int64
    • Personal_Loan: int64
    • Securities_Account: int64
    • CD_Account: int64
    • Online: int64
    • CreditCard: int64
In [ ]:
# Count the duplicated  data
df[df.duplicated()].count()
Out[ ]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

Observations¶

  • There are no duplicated values in the dataset
In [ ]:
# Count null values
df.isnull().sum()
Out[ ]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

Observations¶

  • There are no missing values in the data.

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [ ]:
# The describe() method returns a DataFrame that contains descriptive statistics for each column of the DataFrame.
df.describe(include='all').T
Out[ ]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Observations¶

  • ID: Not apply.
  • Age: The average age is 45.34 years, with a standard deviation of 11.46. The minimum age is 23, and the maximum age is 67.
  • Experience: The average work experience is 20.10 years, with a standard deviation of 11.47. The minimum experience is -3 (which might indicate an error or missing data), and the maximum experience is 43.
  • Income: The average income is 73.77, with a standard deviation of 46.03. The minimum income is 8, and the maximum income is 224.
  • ZIPCode: Not apply.
  • Family: The average family size is 2.40, with a standard deviation of 1.15. The minimum family size is 1, and the maximum family size is 4.
  • CCAvg: The average credit card spending per month is 1.94, with a standard deviation of 1.75. The minimum spending is 0, and the maximum spending is 10.
  • Education: The average education level is 1.88, with a standard deviation of 0.84. The education level ranges from 1 to 3.
  • Mortgage: The average mortgage amount is 56.50, with a standard deviation of 101.71. The minimum mortgage is 0, and the maximum mortgage is 635.
  • Personal_Loan: The percentage of individuals with personal loans is 9.6%. The variable is binary, with 0 indicating no personal loan and 1 indicating a personal loan.
  • Securities_Account: The percentage of individuals with a securities account is 10.4%. The variable is binary, with 0 indicating no securities account and 1 indicating a securities account.
  • CD_Account: The percentage of individuals with a certificate of deposit (CD) account is 6.04%. The variable is binary, with 0 indicating no CD account and 1 indicating a CD account.
  • Online: The percentage of individuals using online banking is 59.68%. The variable is binary, with 0 indicating no online banking and 1 indicating online banking.
  • CreditCard: The percentage of individuals with a credit card is 29.4%. The variable is binary, with 0 indicating no credit card and 1 indicating a credit card.
  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
In [ ]:
# Generating a box plot
plt.figure(figsize=(15, 7))
sns.boxplot(data=df, x='Mortgage')
Out[ ]:
<Axes: xlabel='Mortgage'>

Observation

Mortgage has a right skewed distribution with a high percentage of outliers

  1. How many customers have credit cards?
In [ ]:
use_a_credit_card = df[df['CCAvg'] != 0]
has_a_credit_card = df[df['CreditCard'] == 1]
combined_list = set(use_a_credit_card.index).union(has_a_credit_card.index)

print("Customer whit credit card:", len(combined_list))
Customer whit credit card: 4922
  1. What are the attributes that have a strong correlation with the target attribute (personal loan)?
In [ ]:
# Plot a heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(data=df.corr(), annot=True, cmap='YlGnBu', vmin=-0.2, vmax=1)
Out[ ]:
<Axes: >

Observations¶

In this case, the correlation matrix shows that the following variables are most closely related:

  • Personal_loan and Income (correlation coefficient of 0.5)
  • Personal_loan and CCAvg (correlation coefficient of 0.037)
  • Personal_loan and CD_Account (correlation coefficient of 0.032)
In [ ]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True)
    tab = pd.crosstab(data[predictor], data[target], normalize="index")
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
  1. How does a customer's interest in purchasing a loan vary with their age?
In [ ]:
plt.figure(figsize=(15, 7))
stacked_barplot(df, 'Age', 'Personal_Loan')
<Figure size 1500x700 with 0 Axes>

Observation¶

There is a relationship between productivity, age, and loans. As people age, they tend to become more productive, which can lead to higher earnings. This, in turn, can make them more likely to be able to afford to take out a loan.

  1. How does a customer's interest in purchasing a loan vary with their education?
In [ ]:
stacked_barplot(df, 'Education', 'Personal_Loan')

Observations¶

  • Customers with more education are more likely to be interested in taking out a loan
  • Customers with less education may have fewer financial resources than customers with more education, so they may be more likely to not take out a loan

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [ ]:
# Unique values of all the columns to check values

for column in df.columns:
    print('-'*20)
    print(column)
    print(df[column].unique())
--------------------
ID
[   1    2    3 ... 4998 4999 5000]
--------------------
Age
[25 45 39 35 37 53 50 34 65 29 48 59 67 60 38 42 46 55 56 57 44 36 43 40
 30 31 51 32 61 41 28 49 47 62 58 54 33 27 66 24 52 26 64 63 23]
--------------------
Experience
[ 1 19 15  9  8 13 27 24 10 39  5 23 32 41 30 14 18 21 28 31 11 16 20 35
  6 25  7 12 26 37 17  2 36 29  3 22 -1 34  0 38 40 33  4 -2 42 -3 43]
--------------------
Income
[ 49  34  11 100  45  29  72  22  81 180 105 114  40 112 130 193  21  25
  63  62  43 152  83 158  48 119  35  41  18  50 121  71 141  80  84  60
 132 104  52 194   8 131 190  44 139  93 188  39 125  32  20 115  69  85
 135  12 133  19  82 109  42  78  51 113 118  64 161  94  15  74  30  38
   9  92  61  73  70 149  98 128  31  58  54 124 163  24  79 134  23  13
 138 171 168  65  10 148 159 169 144 165  59  68  91 172  55 155  53  89
  28  75 170 120  99 111  33 129 122 150 195 110 101 191 140 153 173 174
  90 179 145 200 183 182  88 160 205 164  14 175 103 108 185 204 154 102
 192 202 162 142  95 184 181 143 123 178 198 201 203 189 151 199 224 218]
--------------------
ZIPCode
[91107 90089 94720 94112 91330 92121 91711 93943 93023 94710 90277 93106
 94920 91741 95054 95010 94305 91604 94015 90095 91320 95521 95064 90064
 94539 94104 94117 94801 94035 92647 95814 94114 94115 92672 94122 90019
 95616 94065 95014 91380 95747 92373 92093 94005 90245 95819 94022 90404
 93407 94523 90024 91360 95670 95123 90045 91335 93907 92007 94606 94611
 94901 92220 93305 95134 94612 92507 91730 94501 94303 94105 94550 92612
 95617 92374 94080 94608 93555 93311 94704 92717 92037 95136 94542 94143
 91775 92703 92354 92024 92831 92833 94304 90057 92130 91301 92096 92646
 92182 92131 93720 90840 95035 93010 94928 95831 91770 90007 94102 91423
 93955 94107 92834 93117 94551 94596 94025 94545 95053 90036 91125 95120
 94706 95827 90503 90250 95817 95503 93111 94132 95818 91942 90401 93524
 95133 92173 94043 92521 92122 93118 92697 94577 91345 94123 92152 91355
 94609 94306 96150 94110 94707 91326 90291 92807 95051 94085 92677 92614
 92626 94583 92103 92691 92407 90504 94002 95039 94063 94923 95023 90058
 92126 94118 90029 92806 94806 92110 94536 90623 92069 92843 92120 95605
 90740 91207 95929 93437 90630 90034 90266 95630 93657 92038 91304 92606
 92192 90745 95060 94301 92692 92101 94610 90254 94590 92028 92054 92029
 93105 91941 92346 94402 94618 94904 93077 95482 91709 91311 94509 92866
 91745 94111 94309 90073 92333 90505 94998 94086 94709 95825 90509 93108
 94588 91706 92109 92068 95841 92123 91342 90232 92634 91006 91768 90028
 92008 95112 92154 92115 92177 90640 94607 92780 90009 92518 91007 93014
 94024 90027 95207 90717 94534 94010 91614 94234 90210 95020 92870 92124
 90049 94521 95678 95045 92653 92821 90025 92835 91910 94701 91129 90071
 96651 94960 91902 90033 95621 90037 90005 93940 91109 93009 93561 95126
 94109 93107 94591 92251 92648 92709 91754 92009 96064 91103 91030 90066
 95403 91016 95348 91950 95822 94538 92056 93063 91040 92661 94061 95758
 96091 94066 94939 95138 95762 92064 94708 92106 92116 91302 90048 90405
 92325 91116 92868 90638 90747 93611 95833 91605 92675 90650 95820 90018
 93711 95973 92886 95812 91203 91105 95008 90016 90035 92129 90720 94949
 90041 95003 95192 91101 94126 90230 93101 91365 91367 91763 92660 92104
 91361 90011 90032 95354 94546 92673 95741 95351 92399 90274 94087 90044
 94131 94124 95032 90212 93109 94019 95828 90086 94555 93033 93022 91343
 91911 94803 94553 95211 90304 92084 90601 92704 92350 94705 93401 90502
 94571 95070 92735 95037 95135 94028 96003 91024 90065 95405 95370 93727
 92867 95821 94566 95125 94526 94604 96008 93065 96001 95006 90639 92630
 95307 91801 94302 91710 93950 90059 94108 94558 93933 92161 94507 94575
 95449 93403 93460 95005 93302 94040 91401 95816 92624 95131 94965 91784
 91765 90280 95422 95518 95193 92694 90275 90272 91791 92705 91773 93003
 90755 96145 94703 96094 95842 94116 90068 94970 90813 94404 94598]
--------------------
Family
[4 3 1 2]
--------------------
CCAvg
[ 1.6   1.5   1.    2.7   0.4   0.3   0.6   8.9   2.4   0.1   3.8   2.5
  2.    4.7   8.1   0.5   0.9   1.2   0.7   3.9   0.2   2.2   3.3   1.8
  2.9   1.4   5.    2.3   1.1   5.7   4.5   2.1   8.    1.7   0.    2.8
  3.5   4.    2.6   1.3   5.6   5.2   3.    4.6   3.6   7.2   1.75  7.4
  2.67  7.5   6.5   7.8   7.9   4.1   1.9   4.3   6.8   5.1   3.1   0.8
  3.7   6.2   0.75  2.33  4.9   0.67  3.2   5.5   6.9   4.33  7.3   4.2
  4.4   6.1   6.33  6.6   5.3   3.4   7.    6.3   8.3   6.    1.67  8.6
  7.6   6.4  10.    5.9   5.4   8.8   1.33  9.    6.7   4.25  6.67  5.8
  4.8   3.25  5.67  8.5   4.75  4.67  3.67  8.2   3.33  5.33  9.3   2.75]
--------------------
Education
[1 2 3]
--------------------
Mortgage
[  0 155 104 134 111 260 163 159  97 122 193 198 285 412 153 211 207 240
 455 112 336 132 118 174 126 236 166 136 309 103 366 101 251 276 161 149
 188 116 135 244 164  81 315 140  95  89  90 105 100 282 209 249  91  98
 145 150 169 280  99  78 264 113 117 325 121 138  77 158 109 131 391  88
 129 196 617 123 167 190 248  82 402 360 392 185 419 270 148 466 175 147
 220 133 182 290 125 124 224 141 119 139 115 458 172 156 547 470 304 221
 108 179 271 378 176  76 314  87 203 180 230 137 152 485 300 272 144  94
 208 275  83 218 327 322 205 227 239  85 160 364 449  75 107  92 187 355
 106 587 214 307 263 310 127 252 170 265 177 305 372  79 301 232 289 212
 250  84 130 303 256 259 204 524 157 231 287 247 333 229 357 361 294  86
 329 142 184 442 233 215 394 475 197 228 297 128 241 437 178 428 162 234
 257 219 337 382 397 181 120 380 200 433 222 483 154 171 146 110 201 277
 268 237 102  93 354 195 194 238 226 318 342 266 114 245 341 421 359 565
 319 151 267 601 567 352 284 199  80 334 389 186 246 589 242 143 323 535
 293 398 343 255 311 446 223 262 422 192 217 168 299 505 400 165 183 326
 298 569 374 216 191 408 406 452 432 312 477 396 582 358 213 467 331 295
 235 635 385 328 522 496 415 461 344 206 368 321 296 373 292 383 427 189
 202  96 429 431 286 508 210 416 553 403 225 500 313 410 273 381 330 345
 253 258 351 353 308 278 464 509 243 173 481 281 306 577 302 405 571 581
 550 283 612 590 541]
--------------------
Personal_Loan
[0 1]
--------------------
Securities_Account
[1 0]
--------------------
CD_Account
[0 1]
--------------------
Online
[0 1]
--------------------
CreditCard
[0 1]

Column ID is not relevant, same information as index

In [ ]:
df = df.drop('ID', axis=1)

We can observe some wrong values in the Experience column

In [ ]:
df[df['Experience'] < 0]['Experience'].unique()
Out[ ]:
array([-1, -2, -3])
In [ ]:
# Correcting the wrong values
df['Experience'].replace(-1, 1, inplace=True)
df['Experience'].replace(-2, 2, inplace=True)
df['Experience'].replace(-3, 3, inplace=True)

Data distribution

In [ ]:
# Print the data distribution of each column of the dataframe
for column in df.columns:
    plt.figure(figsize=(15, 7))
    sns.histplot(data=df, x=column, kde=True)

Outliers

In [ ]:
# Print the outliers distribution
for column in ['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Mortgage']:
    plt.figure(figsize=(15, 7))
    sns.boxplot(data=df, x=column)

Outliers treatment

In [ ]:
# Find the first quartile, the third quartile, and the interquartile range
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)

# Inter Quantile Range (75th perentile - 25th percentile)
IQR = Q3 - Q1

# Define the lower and upper limits of the normal data range
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR

# Percentage of outliers by column
((df < lower_limit) | (df > upper_limit)).sum() / len(df) * 100
Out[ ]:
Age                    0.00
Experience             0.00
Income                 1.92
ZIPCode                0.00
Family                 0.00
CCAvg                  6.48
Education              0.00
Mortgage               5.82
Personal_Loan          9.60
Securities_Account    10.44
CD_Account             6.04
Online                 0.00
CreditCard             0.00
dtype: float64

Personal_Loan, Securities_Account, CD_Account are Yes/No columns and the values will not be taken into account

Mortgage and CCavg have high percentage of outliers

Removing outliers from Mortgage column

In [ ]:
# Identify any data points that fall outside the normal data range
outliersMortgage = df[df['Mortgage'] < lower_limit['Mortgage']
                      ] | df[df['Mortgage'] > upper_limit['Mortgage']]
# Remove the outliers from the dataframe
df = df.drop(outliersMortgage.index, axis=0)
df.shape
Out[ ]:
(4709, 13)

Checking the outliers percentage

In [ ]:
# Percentage of outliers by column
((df < lower_limit) | (df > upper_limit)).sum() / len(df) * 100
Out[ ]:
Age                    0.000000
Experience             0.000000
Income                 1.571459
ZIPCode                0.000000
Family                 0.000000
CCAvg                  5.776173
Education              0.000000
Mortgage               0.000000
Personal_Loan          8.218305
Securities_Account    10.469314
CD_Account             5.415162
Online                 0.000000
CreditCard             0.000000
dtype: float64

CCAvg still have a high percentage of outliers

Removing the outliers from CCAvg column

In [ ]:
# Identify any data points that fall outside the normal data range
outliersCCAvg = df[df['CCAvg'] < lower_limit['CCAvg']
                   ] | df[df['CCAvg'] > upper_limit['CCAvg']]
# Remove the outliers from the dataframe
df = df.drop(outliersCCAvg.index, axis=0)
df.shape
Out[ ]:
(4437, 13)
In [ ]:
# Percentage of outliers by column
((df < lower_limit) | (df > upper_limit)).sum() / len(df) * 100
Out[ ]:
Age                    0.000000
Experience             0.000000
Income                 0.878972
ZIPCode                0.000000
Family                 0.000000
CCAvg                  0.000000
Education              0.000000
Mortgage               0.000000
Personal_Loan          6.558485
Securities_Account    10.434979
CD_Account             4.778003
Online                 0.000000
CreditCard             0.000000
dtype: float64

Data preparation for modeling

In [ ]:
# Separate independent and dependent variable
x = df.drop(["Personal_Loan"], axis=1)
y = df["Personal_Loan"]

# Splitting data into training and test set:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=1)

# Printing information about train and test data
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
print("Percentage of classes in training set:",
      y_train.value_counts(normalize=True))
print("Percentage of classes in test set:",
      y_test.value_counts(normalize=True))
Number of rows in train data = 3105
Number of rows in test data = 1332
Percentage of classes in training set: Personal_Loan
0    0.937842
1    0.062158
Name: proportion, dtype: float64
Percentage of classes in test set: Personal_Loan
0    0.926426
1    0.073574
Name: proportion, dtype: float64

Model Building¶

Model Evaluation Criterion¶

Model can make wrong predictions as:¶

  1. Predicting a customer will take a loan when they actually will not

  2. Predicting a customer will not take a loan when they actually will

Which case is more important?¶

  • The case 2 is more important.This is because it can lead to a loss for the lender.

How to reduce this loss i.e need to reduce False Negatives?¶

  • recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.

Model Building: Logistic Regression¶

In [ ]:
# fitting logistic regression model
logit = sm.Logit(y_train, x_train)
lg = logit.fit(disp=False)

print(lg.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          Personal_Loan   No. Observations:                 3105
Model:                          Logit   Df Residuals:                     3093
Method:                           MLE   Df Model:                           11
Date:                Sat, 10 Jun 2023   Pseudo R-squ.:                  0.6512
Time:                        00:00:09   Log-Likelihood:                -252.23
converged:                       True   LL-Null:                       -723.04
Covariance Type:            nonrobust   LLR p-value:                6.929e-195
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Age                   -0.2454      0.096     -2.556      0.011      -0.434      -0.057
Experience             0.2358      0.096      2.460      0.014       0.048       0.424
Income                 0.0600      0.004     13.903      0.000       0.052       0.068
ZIPCode            -9.664e-05   2.69e-05     -3.591      0.000      -0.000   -4.39e-05
Family                 0.6682      0.118      5.676      0.000       0.437       0.899
CCAvg                  0.8211      0.100      8.178      0.000       0.624       1.018
Education              1.7559      0.178      9.841      0.000       1.406       2.106
Mortgage              -0.0020      0.002     -1.184      0.236      -0.005       0.001
Securities_Account    -1.6790      0.485     -3.459      0.001      -2.630      -0.728
CD_Account             4.8274      0.556      8.678      0.000       3.737       5.918
Online                -0.9929      0.255     -3.901      0.000      -1.492      -0.494
CreditCard            -1.5420      0.332     -4.644      0.000      -2.193      -0.891
======================================================================================

Possibly complete quasi-separation: A fraction 0.19 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Observations¶

  • The model has a pseudo R-squared of 0.6512, which indicates that the model explains 65.12% of the variation in the dependent variable, Personal_Loan. This is a relatively good fit for a logistic regression model.

  • The coefficients for Age, Experience, Income, ZIPCode, Family, CCAvg, Education, and Mortgage are all statistically significant. This means that these variables are all associated with the probability of taking out a personal loan.

  • Negative values of the coefficient show that the probability of a person of taking out a personal loan decreases with the increase of the corresponding attribute value.

  • Positive values of the coefficient show that the probability of a person of taking out a personal loan increases with the increase of the corresponding attribute value.

  • p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.

In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1, },
        index=[0],
    )

    return df_perf

# defining a function to plot the confusion_matrix of a classification model


def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) +
             "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [ ]:
confusion_matrix_statsmodels(lg, x_train, y_train)

Train performance

In [ ]:
model_performance_classification_statsmodels(lg, x_train, y_train)
Out[ ]:
Accuracy Recall Precision F1
0 0.970048 0.647668 0.833333 0.728863

Test performance

In [ ]:
model_performance_classification_statsmodels(lg, x_test, y_test)
Out[ ]:
Accuracy Recall Precision F1
0 0.95045 0.5 0.742424 0.597561

Model Performance Improvement: Logistic Regression¶

There are different ways of detecting (or testing for) multicollinearity. One such way is using the Variation Inflation Factor (VIF).

General Rule of thumb:

  • If VIF is 1 then there is no correlation among the predictor and the remaining predictor variables, and hence the variance of $\beta_k$ is not inflated at all
  • If VIF exceeds 5, we say there is moderate multicollinearity
  • If VIF is equal or exceeding 10, it shows signs of high multi-collinearity
In [ ]:
vif_series = pd.Series(
    [variance_inflation_factor(x_train.values, i)
     for i in range(x_train.shape[1])],
    index=x_train.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

Age                   1321.272418
Experience             328.767236
Income                   5.279677
ZIPCode                378.160349
Family                   5.610381
CCAvg                    3.828702
Education                6.770196
Mortgage                 1.324530
Securities_Account       1.284347
CD_Account               1.349366
Online                   2.611648
CreditCard               1.554974
dtype: float64

In [ ]:
x_train1 = x_train.drop('Age', axis=1)
x_test1 = x_test.drop('Age', axis=1)

vif_series = pd.Series(
    [variance_inflation_factor(x_train1.values, i)
     for i in range(x_train1.shape[1])],
    index=x_train1.columns,
    dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: 

Experience             4.120057
Income                 5.277348
ZIPCode               22.330119
Family                 5.603378
CCAvg                  3.810217
Education              6.420808
Mortgage               1.324524
Securities_Account     1.282804
CD_Account             1.346788
Online                 2.611264
CreditCard             1.554973
dtype: float64

In [ ]:
logit1 = sm.Logit(y_train, x_train1)
lg1 = logit1.fit(disp=False)

Training performance

In [ ]:
log_reg_model_train_perf = model_performance_classification_statsmodels(
    lg1, x_train1, y_train)
log_reg_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.97037 0.65285 0.834437 0.732558

Test performance

In [ ]:
log_reg_model_test_perf = model_performance_classification_statsmodels(
    lg1, x_test1, y_test)
log_reg_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.951952 0.5 0.765625 0.604938

Observations¶

  • Dropping Age doesn't have a significant impact on the model performance.
  • We can choose any model to proceed to the next steps.

ROC Curve and ROC-AUC

ROC-AUC on training set

In [ ]:
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(x_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(x_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" %
         logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

Logistic Regression model is giving a good performance on training set.

Optimal threshold using AUC-ROC curve

In [ ]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(x_train1))

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.03589693977783063

Checking model performance on training set

In [ ]:
confusion_matrix_statsmodels(
    lg1, x_train1, y_train, threshold=optimal_threshold_auc_roc
)

Training performance

In [ ]:
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, x_train1, y_train, threshold=optimal_threshold_auc_roc
)
log_reg_model_train_perf_threshold_auc_roc
Out[ ]:
Accuracy Recall Precision F1
0 0.884058 0.943005 0.34275 0.502762

Observations¶

  • Recall of model has increased but the other metrics have reduced.
  • The model is still giving a good performance.
In [ ]:
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(x_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(x_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" %
         logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
In [ ]:
# creating confusion matrix
confusion_matrix_statsmodels(
    lg1, x_test1, y_test, threshold=optimal_threshold_auc_roc)

Test performance

In [ ]:
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, x_test1, y_test, threshold=optimal_threshold_auc_roc
)
log_reg_model_test_perf_threshold_auc_roc
Out[ ]:
Accuracy Recall Precision F1
0 0.869369 0.857143 0.344262 0.491228

Precision-Recall Curve

In [ ]:
y_scores = lg1.predict(x_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()

At the threshold of 0.35, we get balanced recall and precision.

In [ ]:
optimal_threshold_curve = 0.35

Checking model performance on training set

In [ ]:
confusion_matrix_statsmodels(
    lg1, x_train1, y_train, threshold=optimal_threshold_curve)

Training performance

In [ ]:
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, x_train1, y_train, threshold=optimal_threshold_curve
)
log_reg_model_train_perf_threshold_curve
Out[ ]:
Accuracy Recall Precision F1
0 0.968116 0.746114 0.742268 0.744186

Test performance

In [ ]:
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, x_test1, y_test, threshold=optimal_threshold_curve
)
log_reg_model_test_perf_threshold_curve
Out[ ]:
Accuracy Recall Precision F1
0 0.943694 0.561224 0.632184 0.594595
  • Model is performing well on training set.
  • There's too much improvement in the model performance as the threshold 0.35

Model Performance Comparison: Logistic Regression¶

In [ ]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-2 Threshold",
    "Logistic Regression-3 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Logistic Regression sklearn Logistic Regression-2 Threshold Logistic Regression-3 Threshold
Accuracy 0.970370 0.884058 0.968116
Recall 0.652850 0.943005 0.746114
Precision 0.834437 0.342750 0.742268
F1 0.732558 0.502762 0.744186
In [ ]:
# Test performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold (0.5)",
    "Logistic Regression-0.035 Threshold",
    "Logistic Regression-0.35 Threshold",
]

print("Test performance comparison:")
models_train_comp_df
Test performance comparison:
Out[ ]:
Logistic Regression-default Threshold (0.5) Logistic Regression-0.035 Threshold Logistic Regression-0.35 Threshold
Accuracy 0.951952 0.869369 0.943694
Recall 0.500000 0.857143 0.561224
Precision 0.765625 0.344262 0.632184
F1 0.604938 0.491228 0.594595

Observations¶

The logistic regressions with different thresholds show that changing the threshold can impact the accuracy, recall, precision, and F1 scores of the model.

The default threshold of 0.5 results in an accuracy of 95.19%, a recall of 50.00%, a precision of 76.56%, and an F1 score of 60.49%. This means that the model correctly predicted 95.19% of the data, but it only correctly identified 50.00% of the positive cases. This is likely because the threshold is too high, and the model is not confident enough to predict positive cases.

The logistic regression with a threshold of 0.035 results in an accuracy of 86.94%, a recall of 85.71%, a precision of 34.43%, and an F1 score of 49.12%. This means that the model correctly predicted 86.94% of the data, but it only correctly identified 85.71% of the positive cases. This is likely because the threshold is too low, and the model is predicting positive cases that are actually negative.

The logistic regression with a threshold of 0.35 results in an accuracy of 94.37%, a recall of 56.12%, a precision of 63.22%, and an F1 score of 59.46%. This means that the model correctly predicted 94.37% of the data, but it only correctly identified 56.12% of the positive cases. This is likely because the threshold is a good balance between being too high and too low, and the model is able to correctly identify both positive and negative cases.

Model Building: Decision Tree¶

In [ ]:
tree_model = DecisionTreeClassifier(
    criterion="gini", random_state=1
)
tree_model.fit(x_train, y_train)
Out[ ]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
In [ ]:
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1, },
        index=[0],
    )

    return df_perf
In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) +
             "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [ ]:
confusion_matrix_sklearn(tree_model, x_train, y_train)

Train performance

In [ ]:
decision_tree_perf_train = model_performance_classification_sklearn(
    tree_model, x_train, y_train)
decision_tree_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Observations¶

  • The model has a zero error rate on the training set.
  • The model has achieved perfect accuracy on the training set.
  • The model has learned the training set perfectly.
  • The model has overfit the training set.

Checking model performance on test¶

In [ ]:
confusion_matrix_sklearn(tree_model, x_test, y_test)

Test performance

In [ ]:
decision_tree_perf_test = model_performance_classification_sklearn(
    tree_model, x_test, y_test)
decision_tree_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.984234 0.857143 0.923077 0.888889

Visualizing the decision tree¶

In [ ]:
feature_names = list(x.columns)
plt.figure(figsize=(20, 30))
tree.plot_tree(tree_model, feature_names=feature_names,
               filled=True, fontsize=9, node_ids=True, class_names=True)
plt.show()
In [ ]:
tree_model.tree_.node_count
Out[ ]:
101
In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        tree_model.feature_importances_, columns=["Imp"], index=x_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.365668
Education           0.313546
Family              0.118694
CCAvg               0.094199
Age                 0.030165
CD_Account          0.024784
ZIPCode             0.024235
Experience          0.009678
Online              0.009392
Mortgage            0.005886
CreditCard          0.002337
Securities_Account  0.001417
In [ ]:
importances = tree_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)),
         importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • According to the decision tree model, Income is the most important variable for predicting the personal_loan.

Model Performance Improvement: Decision Tree¶

Using GridSearch for Hyperparameter tuning of our tree model¶

In [ ]:
# # Choose the type of classifier.
# estimator = DecisionTreeClassifier(random_state=1)

# # Grid of parameters to choose from
# # add from article
# parameters = {'max_depth': np.arange(1, 10),
#               'min_samples_leaf': [1, 2, 5, 7, 10, 15, 20],
#               'max_leaf_nodes': [2, 3, 5, 10],
#               'min_impurity_decrease': [0.001, 0.01, 0.1]
#               }

# # Type of scoring used to compare parameter combinations
# acc_scorer = metrics.make_scorer(metrics.recall_score)

# # Run the grid search
# grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
# grid_obj = grid_obj.fit(x_train, y_train)

# # Set the clf to the best combination of parameters
# estimator = grid_obj.best_estimator_

# # Fit the best algorithm to the data.
# estimator.fit(x_train, y_train)


# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(6, 15),
    "min_samples_leaf": [1, 2, 5, 7, 10],
    "max_leaf_nodes": [2, 3, 5, 10],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

estimator.fit(x_train, y_train)  # Complete the code to fit model on train data
Out[ ]:
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=7,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=7,
                       random_state=1)

Tuned hyperparameters¶

In [ ]:
confusion_matrix_sklearn(estimator, x_train, y_train)

Training performance

In [ ]:
decision_tree_estimator_train = model_performance_classification_sklearn(
    estimator, x_train, y_train)
decision_tree_estimator_train
Out[ ]:
Accuracy Recall Precision F1
0 0.988406 0.891192 0.919786 0.905263
In [ ]:
confusion_matrix_sklearn(estimator, x_test, y_test)

Test performance

In [ ]:
decision_tree_estimator_test = model_performance_classification_sklearn(
    estimator, x_test, y_test)
decision_tree_estimator_test
Out[ ]:
Accuracy Recall Precision F1
0 0.981982 0.826531 0.920455 0.870968

Visualizing the Decision Tree¶

In [ ]:
plt.figure(figsize=(17, 15))

tree.plot_tree(estimator, feature_names=feature_names,
               filled=True, fontsize=9, node_ids=True, class_names=True)
plt.show()
In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )

print(pd.DataFrame(estimator.feature_importances_, columns=[
      "Imp"], index=x_train.columns).sort_values(by='Imp', ascending=False))

# Here we will see that importance of features has increased
                         Imp
Income              0.408582
Education           0.365331
Family              0.131687
CCAvg               0.069162
CD_Account          0.025238
Age                 0.000000
Experience          0.000000
ZIPCode             0.000000
Mortgage            0.000000
Securities_Account  0.000000
Online              0.000000
CreditCard          0.000000
In [ ]:
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)),
         importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Cost Complexity Pruning¶

In [ ]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
Out[ ]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000154 0.000615
2 0.000279 0.001731
3 0.000297 0.002326
4 0.000307 0.003554
5 0.000322 0.004520
6 0.000334 0.006189
7 0.000429 0.006619
8 0.000483 0.007102
9 0.000491 0.007593
10 0.000506 0.009617
11 0.000515 0.010132
12 0.000515 0.010648
13 0.000551 0.011751
14 0.000580 0.012330
15 0.000742 0.013815
16 0.000773 0.014588
17 0.000854 0.015442
18 0.000907 0.016348
19 0.001204 0.017552
20 0.001400 0.021752
21 0.002460 0.024212
22 0.002594 0.026806
23 0.004148 0.030953
24 0.009363 0.040317
25 0.011399 0.051715
26 0.032437 0.116588
In [ ]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [ ]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    # Complete the code to fit decision tree on training data
    clf.fit(x_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.032436553118827774

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [ ]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Accuracy vs alpha for training and testing sets¶

When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 69% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.

In [ ]:
train_scores = [clf.score(x_train, y_train) for clf in clfs]
test_scores = [clf.score(x_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [ ]:
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)
In [ ]:
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ', best_model.score(x_train, y_train))
print('Test accuracy of best model: ', best_model.score(x_test, y_test))
DecisionTreeClassifier(random_state=1)
Training accuracy of best model:  1.0
Test accuracy of best model:  0.9842342342342343

Since accuracy isn't the right metric for our data we would want high recall¶

In [ ]:
recall_train = []
for clf in clfs:
    pred_train3 = clf.predict(x_train)
    values_train = metrics.recall_score(y_train, pred_train3)
    recall_train.append(values_train)
recall_test = []
for clf in clfs:
    pred_test3 = clf.predict(x_test)
    values_test = metrics.recall_score(y_test, pred_test3)
    recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [ ]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)

Post-pruned decision tree¶

In [ ]:
confusion_matrix_sklearn(best_model, x_train, y_train)

Train performance

In [ ]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    best_model, x_train, y_train)
decision_tree_tune_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Checking model performance on test¶

In [ ]:
confusion_matrix_sklearn(best_model, x_test, y_test)

Test performance

In [ ]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(
    best_model, x_test, y_test)
decision_tree_tune_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.984234 0.857143 0.923077 0.888889

Visualizing the Decision Tree¶

In [ ]:
plt.figure(figsize=(20, 30))

tree.plot_tree(best_model, feature_names=feature_names,
               filled=True, fontsize=9, node_ids=True, class_names=True)
plt.show()
In [ ]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )

print(pd.DataFrame(best_model.feature_importances_, columns=[
      "Imp"], index=x_train.columns).sort_values(by='Imp', ascending=False))
                         Imp
Income              0.365668
Education           0.313546
Family              0.118694
CCAvg               0.094199
Age                 0.030165
CD_Account          0.024784
ZIPCode             0.024235
Experience          0.009678
Online              0.009392
Mortgage            0.005886
CreditCard          0.002337
Securities_Account  0.001417
In [ ]:
best_model.tree_.node_count
Out[ ]:
101
In [ ]:
importances = best_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)),
         importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Observation¶

  • Income and Education of loan are the most important features.
In [ ]:
models_train_comp_df = pd.concat(
    [decision_tree_perf_train.T, decision_tree_estimator_train.T,
        decision_tree_tune_perf_train.T], axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn", "Decision Tree Tuned hyperparameters", "Decision Tree Cost Complexity Pruning"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Decision Tree sklearn Decision Tree Tuned hyperparameters Decision Tree Cost Complexity Pruning
Accuracy 1.0 0.988406 1.0
Recall 1.0 0.891192 1.0
Precision 1.0 0.919786 1.0
F1 1.0 0.905263 1.0
In [ ]:
models_train_comp_df = pd.concat(
    [decision_tree_perf_test.T, decision_tree_estimator_test.T,
        decision_tree_tune_perf_test.T], axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn", "Decision Tree Tuned hyperparameters", "Decision Tree Cost Complexity Pruning"]
print("Test performance comparison:")
models_train_comp_df
Test performance comparison:
Out[ ]:
Decision Tree sklearn Decision Tree Tuned hyperparameters Decision Tree Cost Complexity Pruning
Accuracy 0.984234 0.981982 0.984234
Recall 0.857143 0.826531 0.857143
Precision 0.923077 0.920455 0.923077
F1 0.888889 0.870968 0.888889

Model Comparison and Final Model Selection¶

Training

image.png image-2.png

Test

image.png image-3.png

Actionable Insights and Business Recommendations¶

What recommendations would you suggest to the bank?

The feature importances show that the following features are the most important for predicting whether a customer will take out a loan:

  • Income: Customers with higher incomes are more likely to take out loans. This is because they have more disposable income and can afford to make loan payments.
  • Education: Customers with higher levels of education are also more likely to take out loans. This is because they are more likely to be in high-paying jobs and have the financial knowledge to manage a loan.
  • Family: Customers with families are more likely to take out loans. This is because they may need money to pay for things like child care, education, or a new home.
  • CCAvg: CCAvg stands for Credit Card Average. Customers with higher credit card averages are more likely to take out loans. This is because they have a history of borrowing and repaying money, which makes them a lower risk for lenders.
  • Age: Customers who are older are more likely to take out loans. This is because they may need money to retire, pay for medical expenses, or start a business.

The other features are less important for predicting whether a customer will take out a loan. However, they can still be used to improve the accuracy of the model.

Here are some additional insights that can be gained from the feature importances:

  • Income and education are the most important factors for predicting whether a customer will take out a loan. This is because these factors are closely related to a customer's ability to repay a loan.
  • Family size is also a significant factor. This is because larger families tend to have higher expenses, which can make it more difficult to save money for a down payment on a home or other major purchase.
  • Credit card average is a good indicator of a customer's creditworthiness. This is because it shows how well a customer has managed their credit in the past.
  • Age is a less important factor, but it can still be used to improve the accuracy of the model. This is because older customers are more likely to have accumulated assets, such as a home or retirement savings, which can make them a lower risk for lenders.

Decision tree vs logistic regression¶

The comparison between logistic regression and decision tree shows that they have different strengths and weaknesses.

Logistic regression is a linear model that can be used to predict the probability of an event occurring. In this case, the event is a customer taking out a loan. Logistic regression is relatively easy to interpret and can be used to make predictions on new data. However, it can be sensitive to overfitting and may not be as accurate as decision trees.

Decision trees are a non-linear model that can be used to predict the probability of an event occurring. In this case, the event is a customer taking out a loan. Decision trees are relatively easy to train and can be more accurate than logistic regression. However, they can be difficult to interpret and may not be as generalizable to new data.

Ultimately, the best model for AllLife Bank will depend on their specific needs. If they are looking for a model that is easy to interpret and can be used to make predictions on new data, then logistic regression may be the best option. However, if they are looking for a model that is more accurate, then decision trees may be a better choice.

image.png

As you can see, decision trees have a higher accuracy, recall, and F1 score than logistic regression. However, logistic regression is easier to interpret and can be used to make predictions on new data.